[proof of concept] Add support for balancing at the directory level #47

bgimby · 2024-10-21T02:47:53Z

This script works great for datasets containing mainly large files, so that the execution time is dominated by the time spent copying data between disks. However, when working with very large numbers of small files, effective throughput is very low because a lot of time is spent creating and waiting on extra processes (grep, cp, rm) and writing to stdout.

When trying to rebalance a large number of small files, I've found ~100x speedups by copying a directory with cp -rax rather than balancing each file individually.

This PR is a proof-of-concept of adding support for this kind of rebalancing. If there's interest in adding this kind of functionality to this script, I can clean it up and get it to a state to be merged.

The goal here is to allow easy rebalancing in cases where most data is in large numbers of small files, especially where an entire dataset is too large to duplicate, but individual folders are known to be small enough to copy without filling the pool.

As an example, consider the following folder structure, where the pool has 1 TB of capacity:

pool/
  dataset/
    dir_a/ (300 GB)
      <a few hundred big files>
    dir_b/ (100 GB)
       <hundreds of thousands of files>
    huge_file (200 GB)

As the pool is more than half full, we can't call cp on the whole dataset, or we'd run out of space. Calling this script on /pool/dataset would be too slow, because tracking and copying the files in dir_b would have too much overhead. You could call the script on huge_file and dir_a separately, but then you'd still have to find a way to rebalance dir_b.

On this branch, you can invoke the script with --explicit-paths /pool/dataset/*, which will make copies of dir_a, dir_b, and huge_file one at a time, without running out of space.

How it works

The new behavior can be used by passing --explicit-paths to the script, e.g.

./zfs-inplace-rebalancing.sh --checksum false --explicit-paths /pool/dataset/mydirectory

With the --explicit-paths flag set, it also supports passing multiple paths either explicitly or via globbing, as in the above example.

Instead of using find to generate a list of files to rebalance, it will directly use the list of paths provided in the arguments. The rebalance function has been updated to handle copying and removing both directories and files. Otherwise, the logic is largely unchanged, meaning the handling of multiple passes is unchanged.

Limitations

There are a few limitations to this approach, both in general and in what has been done so far in this branch.

Limitations to the idea in general

This is only useful when working with large numbers of small files, enough that the sum to a large portion of the total used space in the pool
This requires the pool to be split into a small number of directories that are individually not too big to be duplicated without running out of space
The user has to manually decide exactly where to rebalance
This makes checksum logic somewhat more complex because you'd have to have different logic for files vs directories
Skipping hardlinks can't be supported, as far as I know

Limitations in this branch (that should be fixed before merging)

Script does not act sane when the explicit paths flag is set, and either other flag is set
explicit paths flag is not documented
explicit paths flag is not supported in osx/freebsd
explicit paths flag is not tested
probably other things...

markusressel · 2024-10-22T22:42:46Z

Hi @bgimby, thx for sharing ❤️

Definitely an interesting idea, and might be fine in the special circumstances you seem to find yourself in.

My main concern would be about aborting the balancing process (for whatever reason). Would it be possible to recover from an aborted run if whole directories are processed in a single "transaction"?

Also, like in #46 I think it would be better to create a completely separate script file for things like these, if we want to keep it in this repo. While it adds redundant code to copy everything, it also fully and cleanly separates the different inner workings that are tailored to specific usecases and therefore makes it much easier to reason about the full script and make changes to it without affecting others. I feel like this repo is becoming more of a collection of scripts for unique ZFS balancing usecases rather than the one "go-to" script to use for everything, as I do not have the resources to maintain such a thing.

Add explicit paths flag

3512d2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proof of concept] Add support for balancing at the directory level #47

[proof of concept] Add support for balancing at the directory level #47

bgimby commented Oct 21, 2024

markusressel commented Oct 22, 2024

[proof of concept] Add support for balancing at the directory level #47

Are you sure you want to change the base?

[proof of concept] Add support for balancing at the directory level #47

Conversation

bgimby commented Oct 21, 2024

How it works

Limitations

Limitations to the idea in general

Limitations in this branch (that should be fixed before merging)

markusressel commented Oct 22, 2024